Adapting code for NVFlare

NVIDIA currently provides plenty of examples for different usecases, with different languages and models. In order to speed up the whole task of reworking the code, we also provide our ihd-federated job in /nvflare/ihd-federated folder for classification in MONAI, but should work also with PyTorch, as MONAI is built on top of the PyTorch.

Config

First, it is necessary to take a look at the config subfolder. There are 3 different config files here:

config_fed_client.json - here the config for client is noted
config_fed_server.json - here the server specific setup is highlighted
config_train.json - this is a suplemental config for client, not part of the normal setup, may be ommited, but needs to be adjusted in SupervisedMonaiRsnaLearner.

In general, you want to use the NVIDIA prebuilt and supplied objects. However, this architecture allows you to take any of the objects, inherit it and create a derivative that will do different things for you. This way you can write your own Learner (basically training code) or analytic streamer. In general, there are several classes of the objects - executors, components, workflows etc.

When it comes to config_fed_server.json, every controller should have a name, which is the name of the class. For custom prebuilt components, please also add a path, which is a path relative to custom folder. Let's say we have learner called SupervisedMonaiRsnaLearner in custom/supervised_mona_rsna_learner.py. The path will be:

   "path": "supervised_monai_rsna_learner.SupervisedMonaiRsnaLearner",

Another thing to note is that here you also specify the model used, inside the Persistor component. Note, model specified here and in code must match!

In config_fed_server.json we can also see other settings not tied to components, executors or workflows:

  "format_version": 2,
  "min_clients": 4,
  "num_rounds": 20,
  "server": {
    "heart_beat_timeout": 600
  }

These are tied to the training - what is the minimal number of the clients to aggregate together, how many rounds (phase from sending the model to aggregation) should be performed, what is the server heart beat timeout, etc.

The config_train.json in our case sums up the configs for the clients. At the moment, the configs are tied together, same for every site, we plan to change this soon.

Custom

In order to fully utilize our solution, it is needed to also understand the custom folder. Here, the actual executed code is located.

custom_persistor.py is a utility class, which extends persistor, in order to save model every epoch also on every client. Afterwards, you can manually retrieve the model. If this is not necessary, we suggest to use PTFileModePersistor instead.

mlflow_receiver.py is a workaround to stream data to MLFlow, instead of Tensorboard. Here it is necessary to change data on lines 87 - 89:

## Enter details of your AzureML workspace
subscription_id = '08217dea-11b6-4ce7-b44e-b82c6345f2a2'
resource_group = 'fl-mvp'
workspace = 'central-workspace'

and line 97:

mlflow.set_experiment("fl-ihd")

supervised_learner.py is a base class of learner used for classification. It shouldn't be needed to make a many adjustments, likely no at all, depending on your use case. local_train/local_validate functions may need update.

supervised_monai_rsna_learner.py is a class, which encapsulates the training, validation and also loading code. Most of your logic will go here.

We will add more material to this section as questions arise. For now, there is a substantial documentation provided by NVIDIA.

Adapting code for NVFlare

Config​

Custom​

Config

Custom